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Abstract 

We propose a general matrix-valued multiple kernel learning framework for high- 
dimensional nonlinear multivariate regression problems. This framework al- 
lows a broad class of mixed norm regularizers, including those that induce spar- 
sity, to be imposed on a dictionary of vector-valued Reproducing Kernel Hilbert 
Spaces |19|. We develop a highly scalable and eigendecomposition-free Block 
coordinate descent procedure that orchestrates two inexact solvers: a Conjugate 
Gradient (CG) based Sylvester equation solver for solving vector-valued Regular- 
ized Least Squares (RLS) problems, and a specialized Sparse approximate SDP 
solver fllSI for learning output kernels. As an application of our framework, we 
show how high-dimensional causal inference tasks can be naturally cast as sparse 
function estimation problems within our framework, leading to novel nonlinear 
extensions of Grouped Graphical Granger Causality techniques. The algorithmic 
developments and extensive empirical studies are complemented by theoretical 
analyses in terms of Rademacher generalization bounds. 

1 Introduction 

This paper is at the intersection of three distinct but symbiotic themes in machine learning and statis- 
tics: (a) non-parametric multivariate regression and structured output learning, (b) sparse learning 
for high-dimensional settings Q, and (c) multiple time series analysis 1 17] and associated temporal 
causal modeling problems 1121111611251 . We begin by considering the general problem of estimating 
an unknown non-linear function f : X ^ y from labeled examples, where the output space y has a 
Hilbert space structure. When y is finite dimensional, the problem is akin to multivariate regression. 
The presence of possible correlations amongst outputs naturally motivates more effective learning 
algorithms that attempt to learn all coordinates jointly instead of treating them independently. Such 
problems can be formulated as regularized risk minimization over a J^-valued hypothesis space of 
functions for which a general Reproducing Kernel Hilbert Space (RKHS) framework has been re- 
cently brought to the attention of machine learning community by |fT9l . However, the Kernel func- 
tion in this setting becomes matrix-valued whose choice turns into a challenging model selection 
problem - certainly much more exaggerated than in the scalar case where Gaussian or Polynomial 
kernels are a default choice that require only a few hyperparameters to be tuned. Compounded fur- 
ther by the computational complexity of solving the resulting optimization problems involving large 
dense matrices, vector-valued extensions of kernel methods are arguably yet to find widespread ap- 
plication, despite the fact that their theory can be traced as far back as the work of Laurent Schwarz 
in 1964 |24|. Our contributions in this paper are as follows: 

o We first seek a scalable resolution of the kernel learning problem in vector-valued settings. To- 
wards this end, we propose a general framework for function estimation over a dictionary of 
vector-valued RKHSs where a broad family of variationally defined regularizers, including spar- 
sity inducing norms, serve to optimally combine a collection of matrix-valued kernels. As such 
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our framework may be viewed as non-parameteric multivariate extension of Group Lasso and 
related sparse learning models Q . 

o An overview on various classes of matrix valued kernels is given in [1]. However, in this paper, 
we restrict attention to separable kernels due to their universality |8|, conceptual simplicity and 
potential for scalability. Separable matrix-valued kernels are composed of a scalar input kernel 
function and an output kernel matrix (to be formally defined later). We focus on vector-valued 
Regularized Least Squares (RLS) algorithms which lead to Sylvester matrix equations of a spe- 
cific form that can be solved in cubic time using appropriate eigendecompositions. These solvers 
are much more efficient than RLS models with general matrix-valued kernels, but nonetheless be- 
come a significant computational bottleneck 111] when invoked repeatedly in the larger context of 
a kernel learning algorithm. To exploit warm starts and additional structure in the problem, we de- 
velop a Conjugate Gradient (CG) based Sylvester equations solver Our block coordinate descent 
approach jointly optimizes input scalar kernel combinations and the output kernel matrix, where 
for the latter we specialize the Sparse SDP solver of [15| for inexact but efficient optimization 
over the cone of positive semi-definite matrices with bounded trace. 

o Empirical results confirm the value of joint optimization of input and output kernels. The use of 
inexact solvers greatly improves the scalability of our algorithms. 

o We provide bounds on Rademacher Complexity for the hypothesis sets considered by our algo- 
rithm. In particular, this extends the results of [lOJ to the vector-valued case. 

o We then turn to multiple time series analysis. We propose to replace classic linear Vector Au- 
toregression (VAR) models traditionally used in time series analysis ( 1 17 1), with sparse nonlinear 
vector- valued RLS models. In particular, we consider the problem of forecasting the evolution of 
a large collection of high-dimensional time series variables, potentially given a small set of obser- 
vations. Various variable groupings, e.g., lagged values of temporal variables, induce a candidate 
pool of kernels from which an optimal small subset is selected by our kernel learning procedures. 
The resulting sparse model can then be naturally endowed with a Graphical Granger Causality 
interpretation. This provides a link between nonparameteric notions of sparsity developed in 
the multiple kernel learning literature, and operational notions of Causality developed by Clive 
Granger [14]. We illustrate the power of our approach on computational biology problems. 



2 Vector- valued RLS & Separable Matrix- valued Kernels 

We begin by providing a brief background and setting up some notation. Let us focus on the Regular- 
ized Least Squares (RLS) framework for finite dimensional multivariate regression settings. Given 
labeled examples {(xj, y.;)} e A" C K'^,yi e 3^ C R" the vector-valued RLS solves the 
following problem, 

argmini^||/(xO-y,|l^ + All/11?, (D 

where T-L-^^ is a vector-valued RKHS generated by the kernel function k , and A > is the regu- 
larization parameter. In the vector-valued setting, fc is a matrix-valued function, i.e., for any pair 

of inputs X, z, k (x, is an n-hy-n matrix, or more generally speaking, an input-dependent linear 
operator on the output space. The kernel function is positive in the sense that for any finite set of m 

input-output pairs {(x^, yj}™ the following holds: X^n^i yf ^ (^ij ^j)yj > 0. A generalized 

representer theorem says that the optimal solution has the form, /(•) — ^ (^«: where 

the coefficients are n-dimensional vectors. For RLS, these coefficient vectors can be obtained 

solving a dense linear system, of the familiar form, + Xllni^ vec{C'^) — vec{Y'^), where 

C = [cxi . . .oti] E M'^" assembles the coefficient vectors into a matrix; the vec operator stacks 

columns of its argument matrix into a long column vector; 1^ is a large nl x nZ-sized Gram matrix 

comprising of the blocks k (x^, Xj), for i,j = 1 . . .1, and !„; denotes the identity matrix of com- 
patible size. It is easy to see that for n — 1, the above developments exactly collapse to familiar 
concepts for scalar RLS (also known as Kernel Ridge Regression). In general though, the above lin- 
ear system requires 0{{nl)^) time to be solved using standard dense numerical linear algebra, which 
is clearly prohibitive. However, for a family of separable matrix-valued kernels defined below, the 
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computational cost can be improved to 0{n^ + 1^) which, though still costly, is comparable to scalar 
RLS when I is much larger than n. 

Separable Matrix Valued Kernel and its Gram matrix: Let k be a scalar kernel function on the 
input space X and K represent its gram matrix on a finite sample. Let li be an n x n positive 

semi-definite output kernel matrix. Then, the function k {x., z) — fc(x, 2)L is positive and hence 

defines a matrix valued kernel. The gram matrix of this kernel /i = K (g) L where ® denotes 
Kronecker product. 

For separable kernels, the corresponding RLS dense linear system (Eqn|2]below) can be reorganized 
into a Sylvester equation (Eqn[3]below): 

(K(8)L + A;i„z)i;ec(C'^) = wec(Y^) (2) 
KCL + AZC = Y (3) 

Sylvester solvers are more efficient than applying a direct dense linear solver for Eqn|2] The classical 
Bartel-Stewart and Hessenberg-Schu£| methods are usually used for solving Sylvester equations. 
They are similar in flavor to an eigendecomposition approach we describe next for completeness, 
though they take fewer floating point operations at the same cubic order of complexity. 

Eigen-decomposition based Sylvester Solver: Let K = TMT^ and L = SNS^ denote 
the eigen-decompositions of K and L respectively where T, S are orthonormal matrices, and 
let the eigenvalues be denoted by "WL = diag{ai . . . ai),lSS = diag{pi . . . pn). Then the so- 
lution to the matrix equation KCL + AC = Y always exists when A > and is given by 

C = TXS where X„ = 

Output Kernel Learning: We will use the shorthand k = fcL to represent the implied separable 
kernel and correspondingly denote its RKHS by HkL,. In recent work 111] develop an elegant ex- 
tension of the vector-valued RLS problem, Eqn. [T] to jointly learn both / e Hki, and L. In finite 
dimensional language, a function /(x) = LCfcx G HkL, is estimated by solving the following 
problem, 

argmin y ||KCL - Y|l^,„ + Airace(C^KCL) + p|lL||^,„ (4) 

where 5" denotes the cone of positive semi-definite matrices. It is shown that the objective function 
is invex, i.e., its stationary points are globally optimal. ifTTI proposed a block coordinate descent 
where for fixed L, C is obtained by solving Eqn. |3] using an Eigendecomposition-based solver 
Under the assumption that C exactly satisfies Eqn. |3] the resulting update for L is then shown to 
automatically satisfy the constraint that L e 5". However, flTi remark that experiments on their 
largest dataset took roughly a day to complete on a standard desktop and that the "limiting factor 
was the solution of the Sylvester equation". 



3 Learning over a Vector- valued RKHS Dictionary 

We seek a fuller resolution of the separable kernel learning problem for vector-valued RLS prob- 
lems, which is eigendecomposition-free and more scalable. In this section, we expand Eqn. |4]to 
simultaneously learn both input and output kernels over a predefined dictionary, and develop opti- 
mization algorithms based on approximate solvers that execute cheap iterations. Consider a dictio- 
nary of separable matrix valued kernels sharing the same output kernel: Pl = {fciL, ■ • ■ ^mL}- 
It is possible for some of the scalar kernels to arise from specific groups of features, i.e. 
ki{x, z) = gi{PiX, Piz) where Pi is a projection from A" to a group of features Xi C X, and 
5i : X I— >■ M is a scalar kernel on Xi. Let 7i{T>j_,) denote the sum space of functions 
T-L{T)-L,) = {f ~ J2j fjlfj ^ ^fcjL, i = 1 . . . m} and equip this space with the following Ip norms: 

\\f\Km-Di.)) = inf/:/=i:,/. IKll/illwfcii.--- Jl/™ll«fc™i.)||p- Note that 11/11 ,^(^^(x,j_)), being the 
li norm of the vector of norms in individual RKHSs, imposes a functional notion of sparsity on the 



'implemented in SLICOT and available in MATLAB via dlyap inbuilt function. 
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vector-valued function /. We now consider objective functions of the form, 

argmin - ^ ||/(x,) - y,||2 + Ar!(/) + r(L) (5) 

/eW(I5L),Le5!f(r) i ^ 

where L is constrained to belong to the spectahedron with bounded trace: 5"(t) — {X e 
5" I trace (X) < r} where S"^ denotes the cone of symmetric positive semi-definite matrices, F 
is a convex differentiable penalty function over L (e.g., p||L||^^q), and i7 is a regularizer whose 

canonical choice is ri(/) ~ \\f\\ip{-H{Vi,))- Our algorithms work for a broad choice of regularizers 
that admit quadratic variational representation of the form 

" llf ll^ 

r!(/) = argmin VH^^+c^(r7) (6) 

for an appropriate function lo : M™ i— > M. We rationalize this framework as follows: Penalty 
functions of the form above define a broad family of structured sparsity-inducing norms that have 
extensively been used in the multiple kernel learning and sparse modeling literature |@l[26l. They al- 
low complex non-differentiable norms to be related back to weighted RKHS norms, and optimizing 
r] in many cases infact admits closed form expressions. Optimizing L over the Spectahedron allows 
us to develop a specialized version of the approximate Sparse SDP solver 1 15] whose iterations in- 
volve the computation of only a single extremal eigenvector of the (partial) gradient at the current 
iterate - this involves relatively cheap operations followed by quick rank-one updates. Furthermore, 
by bounding the trace of L, we show below that a Conjugate Gradient (CG) based iterative Sylvester 
solver for Eqn. |2] would always be invoked on well-conditioned instances and hence show rapid 
numerically convergence (particularly also with warm starts). The trace parameter r also naturally 
appears in our Rademacher generalization analyses. 

First we observe that a basic result concerning sums of scalar RKHSs also holds unsurprisingly for 
the vector- valued case. The proof given in Appendix A follows Section 6 of |3 | replacing scalar 
concepts with corresponding notions from the theory of vector-valued RKHSs 1 19 |. 

Proposition 1. Given a collection of operator-valued reproducing kernels k i . . . fc „ and positive 
scalars rjj > 0,j = 1 . . .m, the function k rj — 'YTi=i Vi ^ i, is the reproducing kernel of the sum 
space % = {f : X ^ 3^|/(x) = X^ili fji^)'fj ^ "^if .} ^'th the norm given by ||/||^^ = 



\\fj\ 



This result combined with the variational representation of the the penalty function in Eqn. |6] 
allows us to reformulate Eqn.|5]in terms of a joint optimization problem over t], L and / G T-Lk^-L, 
where we define the weighted scalar kernel fc^ — J^jVj^'^j- This formulation allows us to 
scale gracefully with respect to m, the number of kernels. Denote the Gram matrix of kr, 
on the labeled data as K^, i.e., = Vj^j^ where Kj we denotes the gram matrices of 

the individual scalar kernel kj . The finite dimensional version of the reformulated problem becomes, 

0(J7, C, L) = i IIK^CL - Y||^^^ + A trace (C^K^CL) + F (L) + lj{t]) (7) 

over C e R"^', L e 5" (t) and ij e R™. The auxiliary function uj{ri) depends on the specific form 
of for Ip norms it is the indicator function of the set r/i > 0, J2"Li vf ^ 1 where ^ + ^ = 1. 

A natural strategy for such a problem is Block Coordinate DescenQ. At termination, the vector- 
valued function returned is /*(x) ~ LC[fc^(x, xi) . . . fcr,(x, x;)]-^. We next describe each block 
minimization problem. 

Conjugate Gradient Sylvester Solver: For fixed ij, L, the optimal C is given by the solution of 
the dense linear system of Eqn|2]or the Sylvester equation |3] with K = K^. General dense linear 
solvers have prohibitive 0{n^l^) cost when invoked on Eqn. [2] The 0{n^ +P) eigendecomposition- 
based Sylvester solver performs much better, but needs to be invoked repeatedly since L as well as 
are changing across (outer) iterations. Instead, we apply a CG-based iterative solver for Eqn|2] 



^Concerning BCD convergence in such a context, see remarks about smoothing operations in |4) 



4 



Despite the large size of the Hnear system, using CG infact has several quantifiable advantages due 
to the special Kronecker structure of Eqn. |2] 

o A CG solver can exploit warm starts by initializing from previous t], L, and allow early termina- 
tion at cheaper computational cost. 

o The large nl x nl coefficient matrix in Eqn|2] never needs to be explicitly materialized. For 
any CG iterate X*^*), matrix- vector products can be efficiently computed since (K^ (g) L + 
A/I„;)uec(X'*^^) = wec(K^X''^L + A/X^*)). CG can exploit additional low-rank or sparsity 
structure in and L for fast matrix multiplication. When the base kernels are either (a) linear 
kernels derived from a small group of features, or (b) arise from randomized approximations, such 
as the random Fourier features for Gaussian Kernel li22J . then — X^jli Vj'^j'^J where Zj 
has dj ^ / columns. In this case, need never be expUcitly materialized and the cost of matrix 
multiplication can be further reduced. 

o CG is expected to make rapid progress in a few iterations in the presence of a strong regularizes 
This is because the coefficient matrix is expected to be well conditioned. Assume that (p = 
maxjsupx^xkj{x,x.) for all j, as in a dictionary of Gaussian Kernels where (p — 1 and that 
X]j Vj ^ 1' which holds for li norm. Then, the condition number (ratio of largest to smallest 

eigenvalue) of the matrix (K^ (8) L + XIT) can be bounded as k = CT,^p^+iA — ^ ^ ^' where 
iTi , pi (and ai , p„) are the largest (smallest) eigenvalues of K and L respectively. Specializing 
standard CG convergence results |6 | to our setting, we see that r, A control how fast CG will make 

(\ k 
j ||C(o) - C*\\fro where C^*^) is the solution at step 

fc, C* is the optimal solution (at current fixed rj and L) and C^"^ is the initial iterate. 

Updates for 77: Existing results |l4l|26l|20| from MKL literature can be adapted to our setting to get 
closed form update rules listed below for some choices of Let aj — rjj yj trace{C'Kj CL) 

where r/j refers to previous value of rjj. For penalty, \\f\\f (•j^(x'l))' '■^^ optimal rjj — 

"J^V (j2"Li ^r') fo"" ^ = 2^- Po"" ™ ^l^s^i'^ s^yl^ penalty, (1 - fi)\\f\\i,CH{Ti^)) + 
t-i-llfWi^cHiv-i^)}' optimal rjj = ■ Other choices are possible, e.g., see Table 1 in f26l. 

Spectahedron Solver: Here, we consider the L optimization subproblem, which is: 
argminLg5„(^) g{L) = i|| AL - Y\\}^^ + Airace(B^L) + r(L) where A = K^C, B ^ C^A. 
Hazan's Sparse SDP solver ifTSl [T2l based on Frank- Wolfe algorithm fO), can be used for prob- 
lems of the general form, L* — argmiriLg^r. trace(L)=i .9(1')' where g is a convex, symmetric 
and differentiable function. In each iteration, it optimizes a linearization of the objective function 
around the current iterate L^, resulting in updates of the form, L^+i = + a^ivkv]: — Lfc) 

where Vk = ApproxEV (Vg(Lk), and ak = min (l, |). Here, ApproxEv is an approx- 
imate eigensolver which when invoked on the gradient of g at the current iterate Lfe (a positive 
semi-definite matrix) computes the single eigenvector corresponding to the smallest eigenvalue, to 
a prespecified precision Cgk~^, where Cg is a curvature constant upper bounded by the largest 
eigevalue of the Hessian of g. Hazan's algorithm is appealing since each iteration itself tolerates 
approximations, the updates pump in rank-one terms, and it comes with the guarantee that after k 
steps, g(Lfe+i) — g(L*) < 8Cgk^^. We specialize Hazan's algorithm it to our framework as follows 
(below, note that A = Kj,C, B = C^A): 

o Using bounded trace constraints, irace(L) < r, instead of unit trace is more meaningful for our 
setting. The following modified updates optimize over 5" (r): L^+i = + akirVkV^^ — L^), 
where Vk is reset to the zero matrix if the smallest eigenvalue is positive. 

o The gradient for our objective is: V.g(L) = G + - diag{G) where G = Vr(L) + AB + 
2A^AL — 2A^Y and diag{ ) assembles the diagnal entries of its argument into a diagonal 
matrix. 

o For r(L) = p||L||^,,^, instead of using Hazan's line search parameter ak, we do exact line 
search along the direction P — rv^Vk ~ T-ik which leads to a closed form expression:Q:fc — 
^ trace(( ;AL Y) AP+pL p+ ^ ABP) ^ ^^^^ valuc cxcccds 1.0, wc Set ak — ^ sincc the objective 
function restricted to a fine is a simple quadratic. 



5 



o The Hessian of g, with r(L) = p||L||j^jj, in vectorized notation is I„2 (g) A^A + /9l„4, whose 
maximum eigenvalue, assuming J^j Vj < 1 as for li norms, can be upper bounded by p plus 
tjj ~ maxj aij where aij refers to the maximum eigenvalue of Kj. Following lfT2l . we get 
Cg < 2t((t^ + p) which implies the bound: g(Lk+i) — g(L*) < 16T{af + p)k~^. 



3.1 Rademacher Complexity Results 



The notion of Rademacher complexity is readily generaUzable to vector-valued hypothesis spaces 
Ipl. Let H be a class of functions f : X y, where Y C K". Let a e E" be a vector 
of independent Rademacher variables, and similarly define the matrix E — [cri, . . . , cri] e M"^'. 
The empirical Rademacher complexity of the vector- valued class H is the function Ri{H) defined 
as TZi{'H) = jEs sup j:^-^Y^\^^ erf f{xi) . For general matrix-valued multiple kernel learning 
algorithms, we consider hypothesis spaces of the form H ~ {f = X^JLi /ii fj ^ ^"^f }' '^^'^ wiih 
the ?p norm = inf/^j^f. \\fi\\-H-g,...,\\f,n\\H^ 

Theorem 3.1. Consider the hypothesis class T-L^ = {f Cz H : \\.f\\ip{-H) ^ 

o For p > 1, the Rademacher complexity of hypothesis class can be upper-bounded as follows: 
Riinl) < where u = ^ 



trace 



trace 



, and q is such that - + - = 1 



o For p > 1 and the special case of separable kernels k i{x., z) — ki{x,z)'L such that 

sup^ ki[x,x.) < n and trace (L) < t we have Ri^H^) < Am^/'^y^^. 
o Forp = 1 we obtain, Ri{'H\) < \\u\\qfor any q > where r]o = ||. 

o For p = 1 and the special case of separable kernels k i{:x., z) — fci(x, 2;)L such that 



supxfci(x, x) < K and trace(L) < t we have Ri{H\) < 



A^r)o2elog(m) 



The proofs are provided in Appendix B. The above results straightforwardly lead to generalization 
bounds. They extend the results of IJOJ . 



4 High Dimensional Non-linear Causal Inference 



Here, our goal is to show how high-dimensional causal inference tasks can be naturally cast as 
sparse function estimation problems within our framework, leading to novel nonlinear extensions 
of Grouped Graphical Granger Causality techniques (see |25 16| and references therein). In this 
setting, there is an interconnected system of N distinct sources of high dimensional time series data 
which we denote as xj e W''^ ,i ^ 1 . . . N. The system is observed from time t — Ito t — T, and 
the goal is to infer the causal relationships between the sources. Let G denote the adjacency matrix 
of the unknown causal interaction graph where Gij = 1 implies that source i causally influences 
source j. In 1980, Clive Granger gave an operational definition for Causality: 

Granger Causality: A subset of sources Ai = {j : Gij — 1} is said to causally influence source i, 
if the past values of the time series collective associated with the source set Ai is predictive of the 
future evolution of the time series associated with source i, with statistical significance, and more so 
than the past values ofi alone. 

A practical appeal of this definition is that it links causal inference to prediction, with the caveat 
that causality insights are bounded by the quality of the underlying predictive model. Furthermore, 
the prior knowledge that the underlying causal interactions are highly selective makes sparsity a 
meaningful prior to use. Prior work in this direction has focused on linear models ||25l [161 while 
many, if not most, natural systems often involve nonlinear interactions. To apply our framework 
to such problems, we model the system as: xj = (x^ , X(_]^ . . . X(_;, . . . , , ^t^-i ■ ■ ■ x^;), 
where I is a lag parameter and /* e 'H(r>L') where is the output kernel matrix, and we work 
with a dictionary of subsystem-specific input kernels: V — U^]^{fcrsL}"!^j^ where krs is one of a 
choice of rrir scalar kernel functions that only depend on the past values of source r, i.e., x^ . . . x[_; . 
Then, by imposing /Mncf/ona/ sparsity in the estimation of /* by solving Eqn.|5]using 
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regularizer, we get a novel nonparameteric implementation for Granger Causality. In particular, 
if 77*^ corresponds to the weight on kernel krs for the estimated function then we set Gij — 
Y^T=i Vjs- addition to recovering the causal graph in this way, the estimated output kernel matrix 
captures the within-source temporal dependencies. 

5 Empirical Studies 

VAR Modeling in Financial Time Series Analysis: We start with a small dataset of weekly log 
returns of 9 stocks from 2004, studied in ll28l l23l in the context of linear multivariate regression 
with output covariance estimation techniques. We consider first-order vector autoregressive (VAR) 
models of the form xt = f{xt-i) where xt corresponds to the 9-dimensional vector of log-returns 
for the 9 companies at week t and the function / is estimated by solving Eqn. |5] Our experimental 
prototcol is exactly the same as [28, 23 1: data is split evenly into a training and a test set and the 
regularizaton parameter A is chosen by 10-fold cross-validation. All other parameters are left at their 
default values. We generated a dictionary of 117 Gaussian kernels defined by univariate Gaussian 
kernels on each of the 9 dimensions with 13 varying bandwidths. Results are shown in Figure [T] 
(table on the left) where we compare our methods in terms of mean test RMSE against standard 
linear regression (OLS) and linear Lasso independently applied to each output coordinate, and the 
sparse multivariate regression with covariance estimation approaches of ll23l l28l . labeled MRCE 
and FES respectively. We see that joint input and output kernel learning (labeled lOKL) yields the 
best return prediction model reported to date on this dataset. As expected, it outperforms models 
obtained by leaving output kernel matrix fixed as the identity and only optimizing scalar kernels 
(IKL), or only optimizing the output kernel for fixed choices of scalar kernel (OKL). Of the 117 
kernels, 13 have 97% of the mass in the learnt scalar kernel combination. In Figure [T] we also 
show the learnt output kernel L, which notably captures strong similarity between large automobile 
companies; Ford and GM. 
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Figure 1: RMSE (top) and Output kernel L (bottom) 

Scalability and Numerical Behaviour; Our main interest here is to observe the classic tradeoff 
in numerical optimization between running few, but expensive steps versus executing several cheap 
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iterations. We use a 102-class image categorization dataset - Caltech-101 - which has been very 
well studied in the multiple kernel learning literature ifTTl l27l [T3]| . There are 30 training images per 
category for a total of 3060 training images, and 1355 test images. Targets are 102 -dimensional 
class indicator vectors. We define a dictionary of kernels using 10 scalar-valued kernels precom- 
puted from visual features and made publically available by the authors of |27|, for 3 training/test 
splits. From previous studies, it is well known that all underlying visual features contribute to object 
discrimination on this dataset and hence non-sparse multiple kernel learning with lp,p > 1 norms 
are more effective. We therefore set p = 1.7 and A = 0.001 without any further tuning, since their 
choice is not central to our main goals in this experiment. We vary the stopping criteria for our 
CG-based Sylvester solver (cge) and the number of iterations (sdpiter) allowed in the Sparse SDP 
solver, for the C and L subproblems respectively. Note that the closed form rj updates for Ip norms 
take negligible time. We compare our algorithms with an implementation in which each subprob- 
lem is solved exactly using an eigendecomposition based Sylvester solver for C, and unconstrained 
updates for L developed in [1 1 J, respectively. To make comparisons meaningful, we set r to a large 
value so that the optimization over L S 5" (r) effectively corresponds to unconstrained minimiza- 
tion over the entire psd cone 5". In Figure |2l we report the improvement in objective function and 
classification accuracy as a function of time (upto 1 hour). We see that insufficient progress is made 
in both extremes: when either the degree of inexactness is intolerable (c^e — 0.1, adpuer = 100) 
or when subproblems are solved to very high precision (cg^ = le — 6, sdpuer — 3000). Our 
solvers are far more efficient than eigendecomposition based implementation that takes an exorbi- 
tant amount of time per iteration for exact solutions. Approximate solvers at appropriate precision 
(e.g., cge = 0.01. sdpiter = 1000) make very rapid progress and return high accuracy models in 
just a few minutes. In fact, averaged over the three training/test splits, the classification accuracy 
obtained is 79.43% ± 0.67 which is highly competitive with state of the art results reported on this 
dataset, with the kernels used above. For example, [27J report 78.2% ± 0.4, |[l3l report 77.7% ± 0.3 
and mU report 75.36%. 



(a) Objective function versus Time 



(b) Accuracy versus Time 
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Figure 2: Caltech 101 Image Categorization 



Causal Inference of Gene Networks: We use time-course gene expression microarray data mea- 
sured during the full life cycle of Drosophila melanogaster |2|. The expression levels of 4028 genes 
are simultaneously measured at 66 time points corresponding to various developmental stages. We 
extracted time series data for 2397 unique genes, and grouped them into 35 functional groups based 
on their gene ontologies listed in Figure Ob) together with the number of genes in parenthesis. 
The goal is to infer causal interactions between functional groups, as well obtain insight on within- 
group relationships between genes. This is an instance of the setting described in Section |4] We 
conducted 4 sets of experiments: with linear and nonlinear dictionaries (Gaussian kernels with 13 
choices of bandwidths per group), and with or without output kernel learning. We use the parame- 
ters A = 0.001 and time lag of 7. Hold-out RMSE from the four models is shown in Figure[3](a) for 
the 35 groups. Clearly, nonlinear models with both input and output kernel learning give the best 
predictive performance implying greater relability in the implied causal graphs. In consutation with 
a professional biologist, we analyzed the causal graphs uncovered by our approach. The difference 
between the graphs uncovered by linear and nonlinear models (shown in Figure [3] (c) and (d)) is 
intriguing. In particular our nonlinear model uncovered the centrality of a key cellular enzymatic 
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(a) Time Series Prediction Performance 

T 



(b) Function Gene Groups 
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Figure 3: Causal Inference in Gene Networks 



activity, ttiat of tielicase, wtiicti was not recognized by tlie linear model. In contrast, the central 
nodes in the linear model are related to membranes (lipid binding and gtpase activity). Nucleic acid 
binding transcription factor activity and transcription factor binding are both related to the heiicase 
activity, which is consistent with biological knowledge of them being tightly coupled. This was not 
captured in the linear model. Molecular chaperone functions, which connect ATPase activity and 
unfolded protein binding, was successfully identified by our model, while the linear model failed to 
recognize its relevance. It is less likely that unfolded protein and lipid activity should be linked as 
suggested by the linear model. In addition, via output kernel estimation, our model also provides 
insight on the conditional dependencies within genes for each functional group, e.g., FigOe) for the 
unfolded protein binding group. 

Acknowledgments: We thank Haim Avron, Satyen Kale, Rick Lawrence, Bonnie Ray and Tara 
Sainath for several technically insightful conversations on this work. 
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Appendix A. 

Proposition 1 (Vector-valued extension of sum of reproducing kernels (section 6) Theorem in (3)). Let 
k I . . . k p be operator-valued reproducing kernels of a dictionary of RKHSs T) = {T-L-^ ^ . . . T-L-^ } map- 
ping X v-^ y with respective norms \\ ■ Wh-^ ■ • • II ' ■ Then k -p = X]r=i ^ »• ^'th Xi > 0,i = 1 ■ . .p, 
is the reproducing kernel of the space T-L-v ~ © ... © Ti.-^ with the norm \\ ■ Wh-t) given by 

Wffn^^ argmin ^ (8) 



10 



Proof. First, note that if the Theorem holds for p = 2, it can be inductively applied to furnish a proof for a 
general p. So without loss of generality, we assume p = 2. As in Section 6 of [51 or Theorem 5 in [5J, we start 
by introducing the product space. 



K 1 k • 



and an inner product on defined by. 



Ai « 1 A2 ^2 

Define the map u : J- 1-^ H-d by /^) = + f^- Due to completeness of H-^^ , H-ji^, the map's Icernel 
A'^ := u~^(0) is a closed subspace of T and thus its orthogonal complement is also a closed subspace. We 
can consider A*'^ as a Hilbert space with the inner product that is the natural restriction of (•, •)jr to A'^. Define 
V : I— >■ T-L-D as the restriction of it to A^^. Then « is a bijection, and we define an inner product on T-Lv by 

In other words, T-L-d is a Hilbert space isomorphic to A'^^ . Fix any / G T-ixi and note that u~^{f) = {v~^{f) + 
n\n e N}. Since v'^if) and N are orthogonal, it is clear by the Pythagorean theorem that v ^ (/) is the 
element of u^^(f) with minimum norm. Thus, 

1 f2M|2 _ ll/'^lllii I ll/"'"ll«2 



{f\f^)<^u-^f Ai A 

We see that the inner product on "Hi) induces the norm claimed in the theorem statement. 

For any x E X and y G [V, it is clear that fc e> is a symmetric positive semi-definite function since its a conic 
combination of base kernels, and that k D(x,-)y = \\ k \{x , ■)y + \2 k 2{x , ■)y G HiOHi = T-Ld- So 
we just need to verify the reproducing property: (/, k t>[x, ■)y)rLT> ~ y)y- Towards this, take any 

/ G and let (f^,f^) = Also, let ^Ti'^T^^j = (^■D{x,-)y^. Next observe that, 

{ji^ - \i~tx{x, ■)y + - \2~t2{x, ■)y^ = 'tvix, ■)y - ~t-D{x, ■)y = 



Hence, (^h^ — \i k i{x, ■), /i ^ — A2 fc 2{x, ■)jy& N and therefore its inner product in T with (/^, /^) 
equals 0. We have, 

{{f\f), - \iti{x, •), 7^' - A2^2(a;, ■) y)T = 
^(/\ (it' - Xi^iix,-)) y)n^^ + j-{f, (it^ - \2t2{x, ■)) yU^^ = 
±{f\lt')n^ +±{f,t'')n^ ={f\t,{x,-)y)n^ + {f ,t\x,-)y)n^ (9) 

Al ''1A2 *2 ''I '=2 

We will use the last equality in the final steps below together with reproducing properties of 1 and k 2 ■ 



{f k'v(x,-)y}nT^ = {v ^if),v '( kT>{x,-)y)) 



T 



{{f\f%{lt\lt'))^ 



Al 1 A2 '=2 

= {.f,~^iix,-)y)n-. +{f\lt^{x,-)y)n-* 

k I k 2 

= {f(x),y}y + {fix),y)y 
= {f{x),y)y 

Appendix B: Rademacher Complexity 

The notion of Rademacher complexity is readily generalizable to vector- valued hypothesis spaces ilSl . Let 
?^ be a class of functions f : X ^ y, where Y C K". Let cr G K" be a vector of independent Rademacher 
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variables, and similarly define the matrix E = [<ti, . . . , <T(] £ R"^'. The empirical Rademacher complexity 
of the class H is the fimction Ri (W) defined as 



sup V crjfixi) 



We first focus on vector-valued RKHS and obtain the following result. 

Theorem 5.1. Let T-L be a vector-valued RKHS with associated kernel k . Consider the hypothesis class 
Hx = {f £ H. : ll/llli < A}. The Rademacher complexity of hypothesis class Hx can be upperbounded as 
follows. 



Ri{Hx) < 



Xtrace 



I 



where K is the Gram matrix of the kernel k . For the special case of separable kernels k (x, z) = fc(x, z)L 
such that 8wp^ A;(x, x) < /t and trace{L) <t we have 



Rii-Hx) < 



Xkt 



Proof. Recall that by the Reproducing property we have cr /(x) = (/, k (x, ■)(t)h- Then we get 



Ri{Hx 



< 



< 
< 

< 



sup {/, V k (x, ■)cTi}n 



< - sup ||/|1hEe 



' feu 



Vx 



Ex 



Vx 



\ \i=l 



Vx 



I 

Vx 



Istrace 



ij Xtrace(^) 



I 



Now for the special case of separable kernels we obtain 



Mnx)<J^. 



We now consider the multiple kernel learning case. 

Theorem 5.2. Let H = {f ^ J2"Li fj J j S H-j^ },and define the i 



□ 



\\f\\t{H)= ini 

J—A^j Jj 

Consider the hypothesis class 1-0^ = {f £ H : \\f\\ij,('H) ^ X}.Forp > 1, the Rademacher complexity of 
hypothesis class can be uperbounded as follows. 
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where u = 



\Jtrace(^i), . . . , \J trace(^m) 



and q is such that - + i = 1. 



For the special case of separable kernels fci(x,z) = ki(^,zyL such that swp^ki{^,'x) < nandtrace^L) < 
T we have 



Ri{H\)<\7v}'\l^. 



For p = Iwe obtain. 



for any g > 0. 



Mn\) < ^lluii 



For the special case of separable kernels /ci(x,z) = ki{T(.,z)'L such that swp^ki{K.,-x) < K,andtrace{L) < 
T we have 



P ,0,1-, ^ A V7?o2e log (m)Kr 
Ri{H,) < ^= 



Proof. We first focus on p > 1. We again make use of the reproducible property of kernels. 



RiiK) = T^s 



< 



I m 
sup ^O-f ^/j (Xi) 
/SKj i=l j = l 



< -E. 



sup ^j('^'-)<^i>W 



sup Vll/ilkj 



1 

7 sup ^II/. K-Ee 



7 sup ^||/j||k,V*^^^^^(^ 



/l,...,/m:||ll/lllWi,-,ll/n.llw„»|L<A j 



Let u : 



\Jtrace(^i), . . ., \J trace(^m) 



and u = II , ■ ■ ■ , ll/m ||-Hm] ■ With this notation we have 



Ri{H\) < 7 sup v^n 
' \Mp<\ 

< y sup lll-llpllullg 



< 



Allull 



where the first inequaUty is a direct consequence of Holder's inequalty. 

X)^iV/kt J and thus conclude 
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We now focus on the special case where p = 1 

I m 

sup ^o-f ^/j(xi) 



Let 



Riini) = jEj, 



< ^E. 



m I ^ 

sup ^{f3,Yl kj{x,-)ai)n 



sup ^ 



5Z 

1=1 

^'^m(x,-) 



and = [ll/illwii • • • 1 ll/"i||«„J ■ With this notation we have 

Ri{Hx) < jEe sup v^w 

i 111,11, ^ i 



< -Ee sup ||g for any p, g : 1/p + 1/g = 1 



||^||l<A 



< yEslliwIlq using||v||p < ||v||i 



yEE 



El 

3 = 1 



< - 



m 

^Ee^,! 
J=l 



1/9 



1/9 



< - 



< - 



^Ee (^trace(E^i^jE) 

j=i 



U=i 



9/2 



1/9 



1/9 



m 



9/2 



J = l 



1/9 



where 770 = 23/22. Here the second inequahty follows by Holder's inequalty, the fourth inequality fol- 
lows by Jensen's inequality, the fifth inequality can be obtained by a similar reasoning as in the single 
kernel case, and the last inequality follows by straightforward extension of Lemma 1 in |10|. Let u = 

"\J trace(^i) , . . . , trace(S.m) ■ We obtain, for any q > 0., 

Ri{Hx) < — l|u|U 

/ 1\ ^ 

Now for the special case of separable kernels we have ||u||q < I X^I^i vIi^t j and thus 

Ri(rt\) < 



Noting that for m > 1 the function q — )■ qm ' ' reaches its minimum for q — 2 log(m), we get 

7i 



□ 
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